Finding needles in haystacks: 
Identity mismatch frequency and facial identity verification 



Markus Bindemann, Meri Avetisyan & Kristy-Ann Blackwell 
Department of Psychology, University of Essex, UK. 



Correspondence to: 

Markus Bindemann, Department of Psychology, University of Essex, C04 3SQ, UK. 

Email: markus@essex.ac.uk 

Tel: +44 (0)1206 873356 

Fax: +44 (0)1206 873801 



Word count (excluding abstract, references and figure captions): 6874 



Abstract 

Accurate person identification is central to all security, police and judicial systems. A 
commonplace method to achieve this is to compare a photo-ID and the face of its 
purported owner. The critical aspect of this task is to spot cases in which these two 
instances of a face do not match. Studies of person identification show that these 
instances often go undetected when mismatches occur regularly in an experiment, but 
this differs from everyday operations in which identity mismatches are rare. The 
current study therefore examined whether infrequent identity mismatches are more 
likely to go undetected by observers. In Experiments 1 and 2, identity mismatches 
were detected equally under low (2%) and high (50%) mismatch prevalence. This 
pattern persisted when viewing conditions were optimized for person identification in 
Experiment 3, by using a card-sorting task in which all face identities could be viewed 
repeatedly, and also under increased task difficulty, by constraining viewing 
conditions temporally in Experiment 4. These results imply that the infrequent 
occurrence of identity mismatches in security settings such as passport control does 
not impair an observer's ability to detect these important events. 

Keywords: unfamiliar, face, person, identification, mismatch, infrequent 



Introduction 

The identification of unfamiliar persons refers to the process by which we 
correctly associate different images, or different encounters, of someone with whom 
we have limited perceptual experience. The human face provides the most reliable 
means to identify a person under these conditions (Burton, Wilson, Cowan, & Bruce, 
1999), but there is now substantial evidence that even unfamiliar face identification 
remains highly error prone. Much of this research has focused on examining memory 
for faces in the context of eyewitness identification accuracy (see, e.g., Lane & 
Meissner, 2008; Lindsay & Pozzulo, 1999; Malpass & Devine, 1981; Searcy, Bartlett, 
& Memon, 1999; Wells & Olson, 2003; Westcott & Brace, 2002). However, in recent 
years it has become clear that unfamiliar face identification remains difficult even 
when memory factors are eliminated. This has been assessed with matching tasks, in 
which observers have to decide if two photographs show the same person's face or two 
different people (e.g., Burton, White, & McNeill, 2010; Kemp, Towell, & Pike, 1997; 
Megreya & Bindemann, 2009). These matching tasks are of great practical relevance. 
Proof of identity at passport control, for example, is typically achieved by comparing a 
person's facial appearance with a photograph contained in an identity document. The 
problem of passport control is therefore one of face matching. 

There is now considerable evidence demonstrating the difficulty of unfamiliar 
face matching, including field studies of life interactions (e.g., Behrmann & Davey, 
2001; Kemp, Towell, & Pike, 1997; Simons & Levin, 1998; Valentine & Mesout, 
2009; Yarmey, 2004), identification from video and CCTV (e.g., Bruce et al., 1999; 
Davis & Valentine, 2009; Henderson, Bruce, & Burton, 2001; Levin & Simons, 1997), 
and lab-based scenarios (e.g., Burton et al., 2010; Megreya & Burton, 2006, 2008; 
Memon & Gabbert, 2003). These studies show that even under best-case conditions 



observers average between 10% and 30% errors in person identification, depending on 
task demands (see, e.g., Megreya & Burton, 2006; Megreya & Bindemann, 2009; 
Clutterbuck & Johnston, 2002). An exploration of individual differences in the ability 
to match unfamiliar faces emphasizes the difficulty of this task further, with some 
observers performing close to chance even under seemingly optimal viewing 
conditions (Burton et al., 2010). 

The inaccuracy with which unfamiliar faces are identified can have dramatic 
consequences and there is now an acknowledgement that psychological research in this 
field is of importance for police, security, and judicial settings (see, e.g., Costigan, 
2007; Jenkins & Burton, 2008; Memon, Vrij, & Bull, 2003; Wells & Olson, 2003). 
However, the relevance of this research to real-life scenarios often depends on the 
methodological details of a study, which are usually tailored to a specific problem at 
hand. In the present study, we consider a manipulation that is important for estimating 
the accuracy of face identification in security settings that require the routine 
verification of a person's identity from face images. The manipulation under 
investigation here derives from the observation that previous studies in this field have 
provided an equal ratio of match to mismatch encounters in face matching 
experiments. In these studies, the probability that a pair of to-be-compared faces 
depicts two different people is always equivalent to the probability that both face 
images depict the same person (see, e.g., Burton et al., 2010; Kemp et al., 1997; 
Megreya & Bindemann, 2009; Megreya & Burton, 2006, 2007). Everyday life rarely 
provides such clear odds; passport and police officers, for example, may encounter an 
identity mismatch only very infrequently, as the provided identity documents will 
usually match the individual at hand. This raises an important question for estimating 
baseline accuracy of face identification in such applied settings, that is whether the 



detection of identity mismatches is affected by the frequency with which these 
mismatches occur. 

This issue is imperative for two reasons. First, there is reason to predict that 
face identification accuracy might decline further when identity mismatches are the 
exception rather than the norm. For example, visual search studies, in which observers 
look for a predefined target, show that error rates soar alarmingly if a target appears 
only infrequently in an experiment (Wolfe, Horowitz, & Kenner, 2005; Wolfe et al., 
2007; Wolfe & Van Wert, 2010). In a compelling demonstration of this effect, Wolfe 
et al. (2005) showed that a 50% target prevalence in an airport-style baggage-screening 
task produces 7% target misses, but error rates soar to 30% with only 1% target 
prevalence. If a similar performance decrement applies to the problem of face 
matching, then this will have obvious implications for the utility of this task in security 
settings such as passport control. It is therefore important to assess identification 
accuracy under these precise conditions. 

The second reason for manipulating the ratio of mismatch-to-match encounters 
is that varying mismatch frequency may reveal something about face perception more 
generally. In visual search tasks, high target prevalence not only results in fewer target 
misses but also produces more false alarms (see, e.g., Wolfe & Van Wert, 2010). 
Varying target frequency therefore appears to induce a criterion shift in observers that 
improves some aspects of task performance but impairs others. Manipulating 
mismatch frequency can therefore also reveal whether observers apply a stable 
identification criterion in unfamiliar face processing or whether this task is subject to 
context-based criterion shifts, despite our extensive visual experience with faces. 

To investigate the role of mismatch frequency, the experiments reported here 
employed a two-alternative forced-choice matching task (2AFC) in which observers 



had to decide if pairs of unfamiliar faces depict one person or two different people 
(see, e.g., Burton et al., 2010; Clutterbuck & Johnston, 2002; Megreya & Bindemann, 
2009; Megreya & Burton, 2006, 2007). Using the 2AFC task, the general approach in 
each experiment was the same. Observers were given a high mismatch prevalence 
version of this task, in which match and mismatch trials were equally likely over the 
course of the experiment. This condition is consistent with previous studies in this 
field and therefore provides a 50:50 match-to-mismatch ratio. In a departure from 
previous studies, observers were also presented with a low mismatch prevalence 
version in which only a solitary mismatch trial was shown, equating to 2% mismatch 
prevalence (and 98% matches) in these experiments. The contrast between the high 
and low prevalence conditions was then used to assess the extent to which these 
frequencies affect mismatch detection. 

Varying mismatch frequency in this manner imposes limitations on data 
analysis. To ensure that performance under high and low mismatch prevalence was 
comparable, the experiments included a subset of critical mismatch stimuli. In the low 
prevalence condition, the solitary mismatch was always drawn from this subset of 
stimuli and presented in the context of forty-nine match trials. In the high prevalence 
condition, each participant was presented with a critical mismatch, twenty-four 
noncritical mismatches, and twenty-five matches. To determine the effect of mismatch 
frequency on face matching accuracy, performance was then compared for the critical 
mismatch in the high prevalence condition with the solitary mismatch in the low 
prevalence condition. Using this methodology, if identity mismatch detection is 
impaired by low mismatch frequency, then the critical mismatches should be more 
frequently classified incorrectly as identity matches in the context of the low 
prevalence task in comparison with the high prevalence condition. 



Experiment 1 

This experiment examined face matching performance under low and high 
mismatch prevalence. In the low mismatch condition, observers were shown a block of 
forty-nine match pairs and a solitary critical mismatch pair, which was always 
presented in the last trial of the block to preserve a mismatch frequency of 2%. This 
was followed by a block of the high prevalence condition, which consisted of twenty- 
five match and twenty-four mismatch pairs, which were randomly intermixed. These 
trials were immediately followed by a critical mismatch pair, which, consistent with 
the low prevalence condition, was also always presented last to preserve an overall 
mismatch frequency of 50%. The approximate probabilities of identity mismatches 
occurring in these conditions were given in advance, by alerting observers that 
mismatches would only occur very infrequently in the first block, but would then 
occur with the same frequency as the match trials in the second part. The aim of this 
manipulation, therefore, was to examine whether identification accuracy of the critical 
face mismatches is impaired under low mismatch prevalence in comparison to the high 
prevalence condition. 

Method 
Participants 

Fifty-four undergraduate students (9 males) from the University of Essex 
volunteered to participate in the experiment. The participants had a mean age of 20.9 
years (SD = 5.6 years) and all had normal or corrected-to-normal vision. 

Stimuli and Procedure 



The stimuli consisted of 74 match (37 male pairs, 37 female pairs) and 30 
mismatch pairs from the Glasgow University Face Database (GUFD) (see Burton et 
al., 2010). The 30 mismatch pairs were broken down further into 24 noncritical 
mismatch pairs (12 male, 12 female) and 6 critical mismatch pairs (3 male and 3 
female). The match and mismatch face pairs were constructed so that faces were 
shown in grayscale on a plain white background. Each face was depicted in full-face 
view and with a neutral expression, and measured maximally 350 pixels in width at a 
screen resolution of 72 ppi. The faces in a pair were positioned in such a way that the 
horizontal distance between the center of each face measured 500 pixels. In each 
match and mismatch face pair, one face image was taken with a high-quality digital 
camera. The other image was a frame of a person's face taken from high-quality video 
footage (for details, see Burton et al., 2010). For each person in the GUFD database, 
these two images were taken on the same day, roughly 15 minutes apart. The resulting 
match pairs therefore provide similar but not identical face images of a person, to 
ensure that the task cannot be performed using simple image matching processes (see 
Hancock, Bruce, & Burton, 2000; Megreya & Burton, 2006). 

The GUFD also provides similarity ratings for all faces in the database. These 
ratings were used to construct viewing conditions in which a high mismatch 
prevalence context could be clearly set, but in which the detection of the critical 
identity mismatch was not trivial. This was achieved by selecting the face identities 
with the highest similarity ratings for the critical mismatch pairs, while faces of lower 
similarity were chosen for the noncritical mismatch pairs of the high mismatch 
prevalence condition. The average similarity ratings for these face pairs were 0.20/1 
(SD = 0.10) and 0.56/1 (SD = 0.04), respectively. Example stimuli are depicted in 
Figure 1. 



8 



FIGURE 1 HERE PLEASE 



In the experiment, match and mismatch pairs were then combined in the 
following manner. In the low prevalence condition, observers were shown 49 match 
and 1 critical mismatch pair. In the high prevalence condition, observers were shown 
25 mismatch pairs, consisting of 1 critical and 24 noncritical mismatches, and 25 
match pairs. The match stimuli were rotated around participants, so that they were 
equally likely to occur in the low and high prevalence condition over the course of the 
experiment. Similarly, the six critical mismatch pairs were rotated across participants 
so that each of these stimuli was seen an equal number of times in the low and high 
prevalence condition. The critical mismatches were rotated in this manner with the 
restriction that each participant saw one male and female face pair. The presentation of 
the match and noncritical mismatch pairs was randomized for each participant, but the 
critical mismatches were always shown on the last trial of a block in the low and high 
prevalence condition. This was done to ensure that the mismatch frequency, provided 
by the context of the preceding trials, was kept constant at 2% and 50% for low and 
high prevalence conditions, respectively. 

In the experiment, participants were shown a central fixation cross for 1 
second, followed by a face display, which remained onscreen until a response was 
registered. Participants were instructed to decide whether a matching or mismatching 
identity pair was shown by pressing the corresponding buttons on a computer 
keyboard. A blocked design was used so that participants were always shown the low 
prevalence block first, followed by the high prevalence condition. Prior to the first 
block, participants were informed that the task provides a lab-based analogue to 



9 



passport control and that identity mismatches will therefore only occur very 
infrequently. Prior to the second block, participants were then informed that match and 
mismatch trials would now occur with equal frequency. Accuracy was emphasized but 
participants were given no feedback on their responses and no indication of the length 
of the experimental blocks. 

Results 

Table 1 shows the mean percentage accuracy for the high and low prevalence 
conditions. The comparison of primary interest is the difference between the critical 
identity mismatches for these conditions. These data show that more of these 
mismatches were detected under low prevalence than in the high prevalence condition, 
Wilcoxon signed-rank test, Z = 2.32, p < 0.05, r = 0.32. Contrary to the prediction that 
low mismatch frequency might reduce mismatch detection accuracy, these results 
therefore show that low mismatch prevalence improved the detection of identity 
mismatches in this experiment. 

TABLE 1 HERE PLEASE 



Face matching accuracy was also analyzed for the match and noncritical 
mismatch pairs. In the low prevalence condition, performance on match trials was 
highly error-prone, with observers erroneously classifying these stimuli as mismatches 
on a quarter of trials. By comparison, observers were considerably more accurate on 
match trials under high mismatch prevalence, t(53) = 5.85, p = 0.01, d = 0.82. This 
difference indicates that low mismatch prevalence reduces accuracy on match trials 
compared to the high prevalence condition, by generating an increase in the number of 



10 



match pairs that are falsely perceived as identity mismatches. Finally, accuracy was 
comparable for match and mismatch trials in the high prevalence condition, t(53) = 
0.31, p = 0.76, d =0.05. 

For completeness, the mean correct response times were also analyzed for 
match and mismatch trials, although only accuracy was emphasized in this task. In the 
low prevalence condition, observers responded in 6.1 sec (SD = 4.5) to match pairs and 
in 5.6 sec (SD = 4.6) to the critical mismatch pairs. In the high prevalence condition, 
match pairs were classified in 5.8 sec (SD = 4.5), mismatch pairs in 5.8 sec (SD = 5.3), 
and critical mismatch pairs in 5.0 sec (SD = 4.3). Paired t-tests showed that the 
difference between the two match conditions, t(53) = 1.13, p = 0.26, d = 0.15, and also 
between the critical mismatch conditions, t(31) = 1.20, p = 0.24, d = 0.20, was not 
significant. Under high mismatch prevalence, response times to match and mismatch 
pairs also did not differ, t(53) = 0.20, p = 0.85, d= 0.03. These data therefore show 
that any effects in accuracy cannot be explained in terms of a speed-accuracy trade-off. 

Discussion 

This experiment provides initial evidence that mismatch detection in unfamiliar 
face identification is not impaired by low mismatch frequency. These results therefore 
suggest that matching face pairs for identity differs from others tasks, such as visual 
search, in which low target frequency impairs target detection (Wolfe et al., 2005; 
Wolfe et al, 2007; Wolfe & Van Wert, 2010). These results also provide preliminary 
evidence that previous studies of unfamiliar face identification, in which match and 
mismatch trials occurred with equal frequency, do not underestimate the difficulty of 
mismatch detection (e.g., Burton et al., 2010; Kemp et al., 1997). In fact, considering 
that detection of the critical mismatches was less accurate under high prevalence, the 



11 



current findings suggest that previous studies may have even underestimated mismatch 
detection accuracy. 

Experiment 1 also points to a possible explanation for this mismatch frequency 
effect. This derives from the observation that the noncritical mismatch pairs, in which 
the faces bore relatively lower similarity to each other, were more likely to be detected 
in the high prevalence condition than the higher- similarity critical mismatches. By 
comparison, the same critical mismatches were detected more frequently when 
mismatch prevalence was low. This suggests that the context of the lower-similarity 
noncritical mismatches provokes an increased perception of an identity match for 
higher-similarity mismatches. At the same time, these high-similarity mismatches may 
not bear sufficient facial resemblance to identity matches, and therefore stand out 
under low mismatch prevalence. These findings therefore demonstrate a context effect 
on face matching performance. This context effect is not manifested in the expected 
impairment of mismatch identification when these occur infrequently, but in an 
improvement in mismatch identification accuracy under these conditions. 

It is notable, however, that the context provided by the low prevalence 
manipulation also resulted in reduced accuracy on match trials in Experiment 1 , 
compared to the high prevalence condition. This is potentially problematic as it 
suggests that observers perceived a higher proportion of mismatches under low 
prevalence than were actually present. This raises the possibility that these additional 
perceived mismatches could have offset, at least in part, the effect of the low mismatch 
frequency manipulation. This issue is investigated further in Experiment 2. 

Experiment 2 



12 



In Experiment 1, low prevalence did not reduce face mismatch detection but 
increased error rates on match trials. This raises the possibility that these perceived 
identity mismatches partially offset the effect of low mismatch frequency. In an 
attempt to address this question, the order of the low and high prevalence tasks was 
reversed in Experiment 2, so that the latter condition was always presented first. This 
manipulation is borne out by the observation that fewer identification errors were 
made on match trials under high mismatch prevalence in Experiment 1 . This suggests 
that the regular contrast between match and mismatch trials may allow observers to 
gain a more accurate criterion to detect identity matches under these conditions. If this 
proves to be the case, then match accuracy under low mismatch prevalence should 
improve if observers are presented first with the high prevalence condition. It is less 
clear, however, how this will affect detection of the critical mismatches. This is 
examined in Experiment 2. 

Method 
Participants 

Thirty-six undergraduate students (6 males) from the University of Essex 
volunteered to participate in the experiment. The participants had a mean age of 20.2 
years (SD = 4.1 years). All had normal or corrected to normal vision and none had 
participated in Experiment 1 . 

Stimuli and Procedure 

Stimuli and procedure were identical to Experiment 1 , except that the high 
mismatch prevalence condition was now shown first, followed by the low mismatch 
prevalence block. 



13 



Results 

Matching performance for all conditions is summarized in Table 1 . 
Performance for the critical mismatches was high and was equivalent under high and 
low prevalence, Z= 0.00, p = 1.00, r = 0.00. This occurred in a context in which 
accuracy on match trials was now also high under low mismatch prevalence, and was 
comparable to match accuracy under high mismatch prevalence, £(36) = 0.77, p = 0.45, 
d = 0.13. Taken together, these results suggest that presenting match and mismatch 
trials in equal proportion first improves the detection of the critical mismatches under 
high prevalence, and also improves observers' criterion for detecting identity matches 
in a subsequent low mismatch prevalence task. 

These observations receive further support from a between-subject analysis 
with Experiment 1 . This comparison shows that critical mismatches were detected 
more frequently in the high prevalence condition of Experiment 2, Mann- Whitney 
Test, Z= 1.90,;? = 0.06, r = 0.20, while match accuracy under low mismatch 
prevalence was now also improved, ?(88) = 4.09, p < 0.01, d= 0.88. Importantly, 
despite this improvement in match performance, the detection of critical mismatches 
remained similarly high to Experiment 1 in the low prevalence condition, Z= 0.16, p = 
0.87, r = 0.02 (for a summary of all comparisons, see Table 2). 

TABLE 2 HERE PLEASE 



Finally, as in Experiment 1 , the mean correct response times were analyzed to 
rule out any speed-accuracy trade-offs. In the low prevalence condition, match 
decisions were made in 5.6 sec (SD = 3.7) while critical mismatch pairs were classified 



14 



in 8.1 sec (SD = 6.0). In the high prevalence condition, match pairs were classified in 
5.3 sec (SD = 3.1), mismatch pairs in 5.4 sec (SD = 4.3), and critical mismatch pairs in 
8.1 sec (SD = 6.2). Paired t-tests showed that the difference between the two match 
conditions, t(35) = 1.23, p = 0.23, d = 0.22, between the critical mismatch conditions, 
t(29) = 0.02, p = 0.98, d = 0.00, and between match and mismatch pairs under high 
mismatch prevalence, f(35) = 0.28, p = 0.78, d= 0.05, were not significant. 

Discussion 

In Experiment 2, identity mismatches were detected with equal proficiency 
under low and high prevalence, while accuracy on match trials was improved in 
comparison with Experiment 1 and was also equivalent under high and low mismatch 
prevalence. These findings demonstrate that mismatch accuracy under low prevalence 
is not artificially high in Experiment 1 due to a large number of perceived mismatches 
on match trials; in Experiment 2 match accuracy was improved in this condition but 
mismatch detection was unaffected. The results of Experiment 2 indicate further that, 
by the simple manipulation of exposing observers to a considerable number of identity 
matches and mismatches first, false alarms on match trials can be reduced under low 
mismatch frequency. This suggests that an even match-to-mismatch ratio is 
advantageous for improving face identification in a subsequent low mismatch 
prevalence task. 

Despite these improvements, observers still made considerable identification 
errors in Experiment 2, both on match and mismatch trials and under high and low 
mismatch prevalence. This raises the question of whether the remaining identification 
errors under low prevalence can be eradicated at all, or if these errors reflect an 
absolute level of performance, which cannot be improved upon further. To investigate 



15 



this possibility, observers were allowed to fine-tune their face identification criteria in 
the next experiment, by reflecting on all match and mismatch decisions throughout the 
duration of the task. 

Experiment 3 

In contrast to the preceding experiments, which presented face pairs onscreen, 
the face pairs were mounted on cards in Experiment 3. Observers then sorted these 
cards into match and mismatch pairs in the high and low prevalence conditions. The 
exact number of identity mismatches among these cards was revealed in advance for 
both conditions and all face pairs could be viewed repeatedly. Experiment 3 therefore 
enabled observers to establish a robust criterion for discriminating identity mismatches 
from matches, by allowing the acquisition of information from all trials simultaneously 
and the correction of possible identification errors throughout the experiment. This 
task should therefore provide an estimate of the best-possible mismatch detection 
accuracy in a face-matching task. 

Method 
Participants 

Eighteen new undergraduate students (5 males) volunteered to participate in the 
experiment, with a mean age of 20.3 years (SD = 1.2 years). All had normal or 
corrected to normal vision and none had participated in the preceding experiments. 

Stimuli and Procedure 

The stimuli consisted of the face pairs employed in the preceding experiments. 
These face pairs were now printed on laminated cards at a size of 2 1 (W) x 1 5 (H) cm. 



16 



Observers were given these cards in two sets - one set for the low prevalence 
condition, followed by the set for the high prevalence condition - and were asked to 
sort each set into match and mismatch pairs. The observers were told explicitly that the 
low prevalence condition only included a single mismatch in the set of fifty cards, 
whereas the high prevalence condition included twenty-five match and twenty-five 
mismatch displays. As in previous experiments, observers were given no time limit to 
perform the task and accuracy was emphasized. 

Results 

The data in Table 1 shows that performance was more accurate on match trials 
under low compared to high mismatch prevalence in this experiment, t(\7) = 4.46, p < 
0.01, d = 1.36. This difference reflects the fact that, at most, participants could only 
make one match error in the low prevalence condition. At the same time, a proportion 
of participants failed to detect the solitary identity mismatch under low prevalence, by 
mistakenly identifying a matching face pair as the mismatch instead. The percentage of 
these missed mismatches is comparable to the high prevalence condition, Z= 1.00, p = 
0.32, r = 0.24. This indicates that mismatch detection is not related to prevalence in 
this experiment, even under the best-case conditions offered by the card-sorting task. 

To further determine the effect of the card sorting manipulation, a direct 
comparison was also conducted between Experiment 2 and 3 (for a summary of all 
comparisons, see Table 2). This analysis shows that performance was almost 
equivalent in both experiments and did not differ on match trials, t{52) = 0.96, p = 
0.34, d = 0.28, and noncritical mismatch trials under high prevalence, £(52) = 0.76, p = 
0.45, d = 0.22. Similarly, this cross-experiment comparison shows that performance 
was highly similar for the critical mismatches in the low prevalence, Z= 0.33, p = 



17 



0.74, r = 0.05, and high prevalence conditions, Z=0.9l,p = 0.36, r = 0.12, of 
Experiment 2 and 3. These comparisons suggest that face matching errors do not arise 
from an unrefined match-mismatch criterion in the preceding experiments, but reflect a 
difficulty in face identification that persists even when observers can re-check 
identification decisions. 

Discussion 

This study replicates the main findings of Experiment 1 and 2, by showing that 
mismatch detection was at least as accurate under low prevalence as in the high 
prevalence condition. In Experiment 3, this was found with a card-sorting task in 
which observers could view and compare all face pairs repeatedly. This allowed 
observers to re-check previous matching decisions continuously and, as a result, errors 
on match trials were kept to a minimum under low mismatch prevalence. Experiment 3 
therefore provides the strongest evidence yet that the absence of a low prevalence 
disadvantage cannot be attributed to the proportion of falsely perceived identity 
mismatches on match trials, by artificially boosting the number of mismatch trials in 
this prevalence condition. 

Importantly, however, matching accuracy was still not perfect in this task. Face 
identification remained error prone under high and low prevalence, and was highly 
comparable to Experiment 2. These residual errors suggest that perfectly accurate 
identification of unfamiliar faces is not possible in these tasks, even when observers 
are informed of the correct number of identity matches and mismatches in advance and 
can compare all stimuli directly. The residual errors therefore appear to reflect an 
absolute level of performance that cannot be improved upon. This raises the question 
of whether, despite the fact that perfect identification accuracy was not possible in the 



18 



preceding experiments, these tasks failed to capture any effects of mismatch frequency 
because observers were effectively operating at ceiling. To explore this issue, the final 
experiment examined the effect of mismatch frequency under increased task difficulty. 

Experiment 4 

The final experiment examined whether low mismatch frequency impairs 
mismatch detection when the conditions for face identification are compromised. The 
preceding experiments provide a best-case scenario for measuring mismatch 
identification, by presenting high-quality same-day photographs of full-face views, and 
by allowing observers to study the face pairs until a response is made. While observers 
were error-prone even under these favourable viewing conditions, matching 
performance was generally high and comparable across all experimental conditions. It 
is therefore conceivable that the discrimination between match and mismatch pairs was 
not sufficiently demanding to capture an effect of mismatch frequency. 

To investigate this possibility, the next experiment employed an identical 
design to Experiment 2, but the display time of the face stimuli was limited. This 
manipulation increases task difficulty by restricting the viewing time to discriminate 
match from mismatch face pairs (for processing limits in face processing, see also 
Bindemann, Burton, & Jenkins, 2005; Bindemann, Jenkins, & Burton, 2007; Jenkins, 
Lavie, & Driver, 2003). If mismatch frequency affects face identification under more 
challenging viewing conditions, then this manipulation should reveal differences in 
mismatch detection between the high and low prevalence conditions. 

Method 
Participants 



19 



Thirty-six undergraduate students (12 males) from the University of Essex 
volunteered to participate in the experiment. The participants had a mean age of 28.1 
years (SD = 7.0 years). All had normal or corrected to normal vision and none had 
participated in the preceding experiments. 

Stimuli and Procedure 

The stimuli and procedure were identical to Experiment 2, except that the 
display time for all face pairs was now limited to 1 second. In Experiment 1 and 2, 
match and mismatch decisions required on average more than 5 seconds (see results 
sections). A display time of 1 second should therefore compromise matching 
performance by limiting information acquisition from the face pairs. 

Results 

The data for Experiment 4 is displayed in Table 1 . Under high mismatch 
prevalence, accuracy on match and mismatch trials was lower than in Experiment 2. A 
cross-experiment comparison shows that these differences are substantial for match, 
t(10) = 3.37, p < 0.01, d = 0.79, and noncritical mismatch trials, t(10) = 5.65, p < 0.01, 
d = 1.33. This confirms that the limited display time was successful in reducing face 
matching accuracy in this experiment. 

The detection of critical mismatches was also lower in this experiment in 
comparison with Experiment 2, both under high, Z = 2. 14, p < 0.05, r = 0.25, and low 
mismatch prevalence, Z=2.37,p < 0.05, r = 0.28. However, despite the overall 
decrease in identification accuracy, the critical mismatches were detected equally often 
under low and high prevalence, Z= 0.24, p = 0.81, r = 0.04. This suggests that the 
similar performance for these conditions in the preceding experiments does not reflect 



20 



a ceiling effect, but that performance remains comparable when task difficulty is 
increased (for a summary of all comparisons, see Table 2). 

Once again, the mean correct response times were also analyzed for 
completeness. Under low mismatch prevalence, match decisions were made in 1.4 sec 
(SD = 0.4) and critical mismatches were classified in 1.5 sec (SD = 0.3). In the high 
prevalence condition, match pairs were classified in 1.6 sec (SD = 0.3), mismatch pairs 
in 2.1 sec (SD = 2.2), and critical mismatch pairs in 1.6 sec (SD = 0.4). Paired t-tests 
showed that match decisions were faster under low mismatch prevalence, t(35) = 3.91, 
p < 0.01, d = 0.66. The difference between the two critical mismatch conditions, and 
between match and mismatch pairs under high mismatch prevalence were not 
significant, t(\6) = l.\3,p = 0.27, d = 0.28, and t(35) = 1.30, p = 0.20, d = 0.27, 
respectively. 

Discussion 

Experiment 4 replicates the important aspects of the preceding experiments, but 
under less-than-optimal viewing conditions that reduce overall accuracy by limiting 
the display time of the face stimuli. This was particularly evident for critical 
mismatches for which performance was reduced in comparison with the non-critical 
mismatches, and is in line with the lower similarity ratings for the latter stimuli (see 
Experiment 1). Despite this, however, no difference in the detectability of the critical 
mismatches was found between the high and low prevalence conditions. This finding 
converges with the preceding experiments to show that identity mismatch detection 
appears unaffected by the frequency with which these face pairs appear. 

A feature of these results that merits some further discussion is that the viewing 
time manipulation of Experiment 4 appeared to impair accuracy on match trials under 



21 



low prevalence less than block order in Experiment 1 (see Table 1). A direct 
comparison between these experiments is not made here as the relative magnitude of 
these effects presumably depends on the specific time constraints that are applied. 
However, these differences do suggest that limiting the number of identity mismatches 
that observers are initially exposed to in an experiment might impair some aspects of 
face matching more than limiting viewing time. 

Both of these observations are, of course, valuable in their own right. Limiting 
viewing time in this manner demonstrates that hasty face identification only serves to 
reduce accuracy. In security settings, such as passport control, that routinely require 
the verification of a large number of face identities, this may impact on the reliability 
of person identification and it may prove important to further determine the time 
course of unfamiliar face matching. Concerning the block order of the frequency 
conditions, on the other hand, the difference between Experiment 1 and the other 
experiments reported here suggests that training observers under high mismatch 
prevalence might improve security tasks requiring match-mismatch decisions. 

General Discussion 

The central aim of the experiments reported here was to examine the role of 
frequency in the detection of face identity mismatches. This was achieved by 
comparing matching performance for a solitary identity mismatch in the context of 
many face matches, with conditions in which matches and mismatches occur with 
equal frequency. In all experiments, performance under low mismatch frequency was 
surprisingly accurate and was similar to the high prevalence condition, in which 
identity mismatches occurred regularly. Moreover, in Experiment 1 , in which the low 
prevalence task preceded the high prevalence task, mismatch detection was actually 



22 



more accurate in the former condition. These findings are important for demonstrating 
that previous studies of unfamiliar face identification, in which matches and 
mismatches occurred with equal frequency, do not underestimate the difficulty of 
identity mismatch detection (e.g., Burton et al., 2010; Kemp et al., 1997; Megreya & 
Bindemann, 2009). 

These results were observed here with a matching task designed to offer 
idealized viewing conditions, by showing high-quality full- face photographs of faces 
and by minimizing any memory demands in Experiment 1 and 2 (see, e.g., Megreya & 
Burton, 2006). Experiment 3 optimized viewing conditions further, by employing a 
card-sorting task that allowed observers to study all face pairs repeatedly before 
finalizing their identification decisions. This manipulation rules out the possibility that 
mismatch detection was enhanced superficially under low prevalence as a result of the 
number of perceived identity mismatches on match encounters. In Experiment 3, the 
proportion of these perceived mismatches was minimized but the detection of genuine 
mismatches remained error prone. The current study also rules out that these results 
reflect a ceiling effect in unfamiliar face identification, whereby the task demands 
were not sufficiently challenging to capture an effect of mismatch frequency. In 
Experiment 4, task difficulty was increased by limiting the display time of the face 
stimuli. This manipulation reduced overall accuracy, but did not affect the relative 
proficiency with which observers were able to identify face mismatches under high 
and low prevalence. 

These findings suggest that matching face pairs for identity differs from others 
tasks, such as visual search, in which low target frequency impairs target detection 
(Wolfe et al, 2005; Wolfe et al, 2007; Wolfe & Van Wert, 2010). In visual search, 
observers respond as soon as a pre-defined target is found in a stimulus display. 



23 



However, observers also require a quitting threshold to decide when to stop searching 
when a target may not be present. The risk here is that this threshold is lowered when 
targets occur infrequently, leading observers to quit in less than the required time to 
find a target (Wolfe et al., 2005). Target-present responses are therefore made with 
greater certainty than target-absent responses, which depend on an observer's 
judgement of how quickly the targets are generally found. 

Face matching presents a problem that poses different challenges. Observers do 
not search for the target per se, but have to search for visual information that is either 
shared by two target faces (an identity match) or discriminates one face from the other 
(a mismatch). The exact nature of this information cannot be pre-defined as clearly as 
the target characteristics in visual search (see, e.g., Burton, Jenkins, Hancock, & 
White, 2005; Megreya & Burton, 2006), so neither a match or mismatch response can 
be associated with the same degree of certainty. We suggest, therefore, that face 
identification is less susceptible to target frequency because the visual information 
required to perform this task can vary considerably from face to face, as well as from 
matches to mismatches, and may require a case-by-case analysis. 

It is notable, however, that mismatch frequency appears to affect at least some 
aspects of face identification. In Experiment 1 , the low prevalence condition was 
presented prior to the high prevalence block. Under these conditions, the critical 
mismatches were, in fact, detected less frequently under high prevalence, while 
performance was also impaired on match trials under low prevalence. In all subsequent 
experiments, this pattern was adjusted by presenting the high prevalence condition 
first, leading to similarly high accuracy across the experimental conditions. A possible 
explanation for these differences is that observers possess inadequate criteria for 
differentiating identity matches and mismatches at the beginning of an experiment. 



24 



Therefore, when a large number of successive identity matches are shown initially, the 
defining differences between identity matches and mismatches cannot be established 
clearly, which might then lead to a more lenient match-mismatch criterion. In line with 
this reasoning, one might expect that a greater proportion of match pairs is classified 
erroneously as identity mismatches under such conditions, as was observed under low 
prevalence in Experiment 1 . In turn, this should also increase the probability that 
higher-similarity mismatches are perceived as identity matches, leading to the reduced 
accuracy for critical mismatches under high prevalence in the same experiment. 

This explanation draws some support from the finding that eyewitnesses tend 
to make relative judgement decisions in identity line-ups, so that observers often 
simply select the person who resembles a previously-encountered culprit most (see, 
e.g., Wells, 1984). This can be particularly problematic for line-ups in which the 
culprit is absent, as some line-up members will inevitably bear a relatively closer 
resemblance to the culprit than others and are therefore more likely to be chosen. 
Further support comes from the observation that witnesses tend to apply more liberal 
identification criteria to culprit-absent identity line-ups and, consequently, the innocent 
foils in such line-ups may meet the identification criteria for the culprit more easily 
(see Flowe & Ebbeson, 2007; Wells & Olson, 2003). If the same principles apply to 
face matching, then one might expect a similar pattern to Experiment 1 , whereby low 
mismatch prevalence leads to an increased perception that identity matches are 
mismatches and, also, the false acceptance of high-similarity mismatches as identity 
matches under high prevalence. 

At present, this explanation is clearly speculative and sits uneasily with reports 
that performance on match and mismatch trials appears to be driven by dissociable 
factors in face identification (see, e.g., Bruce et al., 1999; Megreya & Burton, 2006; 



25 



Megreya & Burton, 2007). The current findings appear to converge with this evidence 
too, by showing that mismatch detection is largely unaffected by the context of the 
match-to-mismatch ratio (in Experiments 2-4). One way to console these findings 
could be that observers apply dissociable criteria to detect identity matches and 
mismatches, but require some initial exposure to both for honing these decisions. If 
such a "calibration" period is necessary, then it is still unclear how any criteria are 
established in first place, as observers were provided with no feedback on their 
accuracy, and how many trials are required to do so. These are important theoretical 
questions and we can only provide a starting point here. 

From an applied perspective, on the other hand, the current experiments give 
rise to several insights that may be valuable for person identification in everyday 
settings. The most important finding derives from the observation that mismatch 
detection was not reduced under low prevalence in any of the experiments. This 
suggests that previous studies that have not manipulated mismatch frequency (e.g., 
Bruce et al., 1999; Burton et al., 2010; Davis & Valentine, 2009; Kemp et al., 1997; 
Megreya & Burton, 2006), may nonetheless provide adequate estimates of face 
identification accuracy for applied settings in which identity mismatches occur only 
infrequently. These settings include a range of important routine tasks such as passport 
control at national borders and airports, criminal identification by police, or age 
verification in liquor stores. 

A potential caveat to this conclusion is that face identification was assessed 
with a lab-based scenario, in which observers matched pairs of photographs. Face 
identification in security settings, on the other hand, involves matching the appearance 
of a face photograph to a live person. This situation can give rise to noticeably poorer 
matching performance, suggesting that additional factors influence the accuracy of 



26 



live-to-photo identification (Kemp et al., 1997). However, direct comparisons between 
live-to-photo and photo-to-photo matching, under carefully controlled conditions, 
indicate that performance is in fact comparable in both tasks (Megreya & Burton, 
2008). We suggest, therefore, that the absence of an infrequent mismatch disadvantage 
in the experiments reported here will also generalize to live-to-photo matching in 
everyday security settings. 

At the same time, however, the present results also underline existing evidence 
that unfamiliar person identification is an error-prone process that must be 
implemented with caution (see, e.g., Costigan, 2007; Jenkins & Burton, 2008; Wells & 
Olson, 2003). In the current study, this is particularly striking in the card-sorting task 
of Experiment 3. This task provides a best-case scenario for unfamiliar face 
identification but, even so, perfect accuracy was not possible. This concern is 
heightened by the observation that face-matching accuracy declined further when 
viewing time was limited in Experiment 4. In settings that require the fast routine 
verification of a large number of face identities, this may further reduce the reliability 
of person identification. 



27 



References 

Behrman, B. W., & Davey, S. L. (2001). Eyewitness identification in actual criminal 

cases: an archival analysis. Law & Human Behavior, 25, 475-491. 
Bindemann, M., Burton, A. M., & Jenkins, R. (2005). Capacity limits for face 

processing. Cognition, 98, 177-197. 
Bindemann, M., Jenkins, R., & Burton, A. M. (2007). A bottleneck in face 

identification: repetition priming from flanker faces. Experimental Psychology, 

54, 192-201. 
Bruce, V., Henderson, Z., Greenwood, K., Hancock, P. J. B., Burton, A. M., & Miller, 

P. (1999). Verification of face identities from images captured on video. Journal 

of Experimental Psychology: Applied, 5, 339-360. 

Bruce, V., Henderson, Z., Newman, C, & Burton, A. M. (2001). Matching identities 
of familiar and unfamiliar faces caught on CCTV images. Journal of 
Experimental Psychology: Applied, 7, 305-327. 

Burton, A. M., Jenkins, R, Hancock, P. J. B., & White, D. (2005). Robust 
representations for face recognition: The power of averages. Cognitive 
Psychology, 51, 256-284. 

Burton, A. M., Wilson, S., Cowan, M., & Bruce, V. (1999). Face recognition in poor- 
quality video: Evidence from security surveillance. Psychological Science, 10, 
243-248. 

Burton, A. M., White, D., & McNeill, A. (2010). The Glasgow Face Matching Test. 
Behavior Research Methods, 42, 286-291. 

Clutterbuck, R., & Johnston, R. A. (2002). Exploring levels of face familiarity by 
using an indirect face-matching measure. Perception, 31, 985-994. 



28 



Costigan, R. (2007). Identification from CCTV: The risk of injustice. Criminal Law 
Review, 591-608. 

Davis, J. P., & Valentine, T. (2009). CCTV on trial: Matching video images with the 
defendant in the dock. Applied Cognitive Psychology, 23, 482-505. 

Flowe, H. D., & Ebbeson, E. B. (2007). The effect of lineup member similarity on 
recognition accuracy in simultaneous and sequential lineups. Law & Human 
Behavior, 31, 33-52. 

Hancock, P. J. B., Bruce, V., & Burton, A. M. (2000). Recognition of unfamiliar faces. 
Trends in Cognitive sciences, 4, 330-337. 

Henderson, Z., Bruce, B., & Burton, A. M. (2001). Matching the faces of robbers 
captured on video. Applied Cognitive Psychology, 15, 445-464. 

Jenkins, R., & Burton, A. M. (2008). Limitations in facial identification: The 
Evidence. Justice of the Peace, 172, 4-6. 

Jenkins, R., Lavie, N., & Driver J. (2003). Ignoring famous faces: Category-specific 
dilution of distractor interference. Perception & Psychophysics, 65, 298-309. 

Kemp, R., Towell, N., & Pike, P. (1997). When seeing should not be believing: 

Photographs, credit cards and fraud. Applied Cognitive Psychology, 11, 21 1-222. 

Lane, S. M., & Meissner, C. A. (2008). A "middle" road approach to bridging the 
basic-applied divide in eyewitness identification research. Applied Cognitive 
Psychology, 22, 779-787. 

Levin, D. T., & Simons, D. J. (1997). Failure to detect changes to attended objects in 
motion pictures. Psychonomic Bulletin & Review, 4, 501-506. 



29 



Lindsay, R. C. L., & Pozzulo, J. D. (1999). Sources of eyewitness identification errors. 
InternationalJournal of Law and Psychiatry, 22, 347-360. 

Malpass, R. S., & Devine, P. G. (1981). Eyewitness identification: Lineup instructions 
and the absence of the offender. Journal of Applied Psychology, 66, 482-489. 

Megreya, A. M., & Bindemann, M. (2009). Revisiting the processing of internal and 
external features of unfamiliar faces: The headscarf effect. Perception, 38, 1831- 
1848. 

Megreya, A. M., & Burton, A. M. (2006). Unfamiliar faces aren't faces: Evidence 
from a matching task. Memory & Cognition, 34, 865 - 876. 

Megreya, A. M., & Burton, A. M. (2007). Hits and false positives in face matching: A 
familiarity based dissociation. Perception & Psychophysics, 69, 1 175-1 184. 

Megreya, A. M., & Burton, A. M. (2008). Matching faces to photographs: poor 
performance in eyewitness memory (without the memory). Journal of 
Experimental Psychology: Applied, 14, 364-372. 

Memon, A., & Gabbert, F. (2003). Improving the identification accuracy of senior 
witnesses: Do pre-lineup questions and sequential testing help? Journal of 
Applied Psychology, 88, 341-347. 

Memon, A., Vrij, A., & Bull, R. (2003). Psychology and law: Truthfulness, accuracy 
and credibility (2 nd ed.). Chichester, England: Wiley. 

Searcy, J. H., Bartlett, J. C, & Memon, A. (1999). Age differences in accuracy and 
choosing in eyewitness identification and face recognition. Memory & 
Cognition, 27, 538-552. 

Simons, D. J., & Levin, D. T. (1998). Failure to detect changes to people in real-world 
interaction. Psychonomic Bulletin & Review, 5, 644-649. 

30 



Valentine, T., & Mesout, J. (2009). Eyewitness identification under stress in the 

London Dungeon. Applied Cognitive Psychology, 23, 151-161. 
Wells, G. L. (1984). The psychology of lineup identifications. Journal of Applied 

Social Psychology, 14, 89-103. 
Wells, G. L., & Olson, E. (2003). Eyewitness identification. Annual Review of 

Psychology, 54, 277-295. 
Westcott, H., & Brace, N. (2002). Psychological factors in witness evidence and 

identification. In N. Brace & H. Westcott (Eds.), Applying Psychology (pp. 117- 

178). Milton Keynes, England: Open University. 
Wolfe, J. M., Horowitz, T. S., & Kenner, N. M. (2005). Rare items missed in visual 

searches. Nature, 435, 439-440. 
Wolfe, J. M., Horowitz, T. S., Van Wert, M., Kenner, N. M., Place, S. S., & Kibbi, N. 

(2007). Low target prevalence is a stubborn source of error in visual search 

tasks,. Journal of Experimental Psychology: General, 136, 623-638. 
Wolfe, J. M., & Van Wert, M. (2010). Varying target prevalence reveals two 

dissociable decision criteria in visual search. Current Biology, 20, 121-124. 
Yarmey, A. D. (2004). Eyewitness recall and photo identification: A field experiment. 

Psychology, Crime, & Law, 10, 53-68. 



31 



FIGURE 1. Examples of the face pairs employed in the matching tasks in 
Experiments 1 to 4, depicting a critical identity mismatch (A) and an identity 
match (B). 



B 






32 



TABLE 1 . Mean percentage accuracy (and standard deviations) for the high 
and low prevalence conditions in Experiments 1 to 4 for the critical identity 
mismatches (Crit Mismatch), noncritical mismatches (Ncrit Mismatch) and for 
matching face pairs (Match). 







Crit Mismatch 


Ncrit Mismatch 


Match 


Low Prevalence 


High Prevalence 


High Prevalence 


Low Prevalence 


High Prevalence 


Experiment 1 


92.6 (26.4) 


75.9 (43.2) 


88.3 (12.2) 


74.2 (17.8) 


87.6 (12.8) 


Experiment 2 


91.7 (28.0) 


91.7 (28.0) 


95.6 (7.0) 


88.8 (14.4) 


90.4 (8.6) 


Experiment 3 


88.9 (32.3) 


83.3 (38.3) 


94.2 (4.6) 


99.8 (0.7) 


92.7 (6.8) 


Experiment 4 


69.4 (46.7) 


72.2 (45.4) 


80.4(14.5) 


85.5 (13.6) 


81.3 (13.8) 















33 



TABLE 2. A summary of comparisons for Experiments 1 to 4. 



Crit Mismatch 



High Prevalence 



Low versus High Prevalence 



Low versus High Prevalence 



Match versus Ncrit Mismatch 



Experiment 1 
Experiment 2 
Experiment 3 
Experiment 4 



Z= 2.32, p < 0.05, r = 0.32 
Z = 0.00, p = 1.00, r = 0.00 
Z = 1.00, p = 0.32, r = 0.24 
Z = 0.24, p = 0.81, r = 0.04 



t{53) = 5.85, p < 0.01, d = 0.82 
t(35) = 0.77, p = 0.45, d = 0.13 
t(17) = 4.46, p < 0.01, d = 1.36 
f(35) - 1.35, p - 0.18, d = 0.23 



t(53) = 0.31, p = 0.76, d = 0.05 
t(35) = 2.71, p < 0.05, d = 0.46 
t(17) = 1.37, p = 0.19, d = 0.34 
t(35) - 0.29, p = 0.77, d - 0.05 



Low Prevalence 

Crit Mismatch 
Match 

High Prevalence 



Experiment 1 vs. Experiment 2 



Z = 0.16, p = 0.87, r = 0.02 
t(88) = 4.09, p < 0.01, d = 0.8! 



Experiment 2 vs. Experiment 3 Experiment 2 vs. Experiment 4 



Z = 0.33, p = 0.74, r = 0.05 Z = 2.37, p < 0.05, r = 0.28 

t(52) = 3.23, p < 0.01, d = 0.93 t(70) = 1.00, p = 0.32, d = 0.24 



Crit Mismatch Z = 1.90, p = 0.06, r = 0.20 

Ncrit Mismatch t(88) = 3.26, p < 0.01, d = 0.70 

Match t(88) = 1.16, p = 0.25, d = 0.25 



Z = 0.91, p = 0.36, r = 0.12 
t(52) = 0.76, p = 0.45, d = 0.22 
t(52) = 0.96, p = 0.34, d = 0.28 



Z = 2.14, p < 0.05, r = 0.25 
t(70) = 5.65, p < 0.01, d = 1.33 
t(70) = 3.37, p < 0.01, d = 0.79 



34 



